Learning Unsupervised Visual Grounding Through Semantic Self-Supervision
نویسندگان
چکیده
Localizing natural language phrases in images is a challenging problem that requires joint understanding of both the textual and visual modalities. In the unsupervised setting, lack of supervisory signals exacerbate this difficulty. In this paper, we propose a novel framework for unsupervised visual grounding which uses concept learning as a proxy task to obtain self-supervision. The simple intuition behind this idea is to encourage the model to localize to regions which can explain some semantic property in the data, in our case, the property being the presence of a concept in a set of images. We present thorough quantitative and qualitative experiments to demonstrate the efficacy of our approach and show a 5.6% improvement over the current state of the art on Visual Genome dataset, a 5.8% improvement on the ReferItGame dataset and comparable to state-of-art performance on the Flickr30k dataset.
منابع مشابه
On supervision and statistical learning for semantic multimedia analysis
Media analysis for video indexing is witnessing an increasing influence of statistical techniques. Examples of these techniques include the use of generative models as well as discriminant techniques for video structuring, classification, summarization, indexing, and retrieval. There is increasing emphasis on reducing the amount of supervision and user interaction needed to construct and utiliz...
متن کاملAnalysis of Audio-Visual Features for Unsupervised Speech Recognition
Research on “zero resource” speech processing focuses on learning linguistic information from unannotated, or raw, speech data, in order to bypass the expensive annotations required by current speech recognition systems. While most recent zero-resource work has made use of only speech recordings, here, we investigate the use of visual information as a source of weak supervision, to see whether ...
متن کاملGrounding Language in Descriptions of Scenes
The problem of how abstract symbols, such as those in systems of natural language, may be grounded in perceptual information presents a significant challenge to several areas of research. This paper presents the GLIDES model, a neural network architecture that shows how this symbol-grounding problem can be solved through learned relationships between simple visual scenes and linguistic descript...
متن کاملConfidence Driven Unsupervised Semantic Parsing
Current approaches for semantic parsing take a supervised approach requiring a considerable amount of training data which is expensive and difficult to obtain. This supervision bottleneck is one of the major difficulties in scaling up semantic parsing. We argue that a semantic parser can be trained effectively without annotated data, and introduce an unsupervised learning algorithm. The algorit...
متن کاملConcept Grounding to Multiple Knowledge Bases via Indirect Supervision
We consider the problem of disambiguating concept mentions appearing in documents and grounding them in multiple knowledge bases, where each knowledge base addresses some aspects of the domain. This problem poses a few additional challenges beyond those addressed in the popular Wikification problem. Key among them is that most knowledge bases do not contain the rich textual and structural infor...
متن کامل